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Polar Codes are Optimal for Lossy Source Coding 

Satish Babu Korada and Rudiger Urbanke 



0^ 
O 
O 



(N 



C/2 



> 

o 
cn 
p 

cn 
o 

OS 
O 



X 



Abstract — We consider lossy source compression of a binary 
symmetric source using polar codes and the low-complexity 
successive encoding algorithm. It was recently shown by Arikan 
that polar codes achieve the capacity of arbitrary symmetric 
binary-input discrete memoryless channels under a successive 
decoding strategy. We show the equivalent result for lossy source 
compression, i.e., we show that this combination achieves the 
rate-distortion bound for a binary symmetric source. We further 
show the optimality of polar codes for various problems including 
the binary Wyner-Ziv and the binary Gelfand-Pinsker problem. 



I. Introduction 

Lossy source compression is one of the fundamental prob- 
lems of information theory. Consider a binary symmetric 
source (BSS) Y. Let d(-, ) denote the Hamming distortion 
function, 

d(0,0) = d(l,l) = 0,d(0,l) = 1. 

It is well known that in order to compress Y with average 
distortion D the rate R has to be at least R{D) = \ — h2{D), 
where /i2(') is the binary entropy function [1], [2, Theorem 
10.3.1]. Shannon's proof of this rate-distortion bound is based 
on a random coding argument. 

It was shown by Goblick that in fact linear codes are 
sufficient to achieve the rate-distortion bound [3], [4, Section 
6.2.3]. 

Trellis based quantizers [5] were perhaps the first "practical" 
solution to source compression. Their encoding complexity is 
linear in the blocklength of the code (Viterbi algorithm). For 
any rate strictly larger than R{D) the gap between the expected 
distortion and the design distortion D vanishes exponentially 
in the constraint length. However, the complexity of the en- 
coding algorithm also scales exponentially with the constraint 
length. 

Given the success of sparse graph codes combined with low- 
complexity message-passing algorithms for the channel coding 
problem, it is interesting to investigate the performance of such 
a combination for lossy source compression. 

As a first question, we can ask if the codes themselves are 
suitable for the task. In this respect, Matsunaga and Yamamoto 
[6] showed that if the degrees of a low-density parity-check 
(LDPC) ensemble are chosen as large as 9(log(A^)), where 
TV is the blocklength, then this ensemble saturates the rate- 
distortion bound if optimal encoding is employed. Even more 
promising, Martininian and Wainwright [7] proved that prop- 
erly chosen MN codes with bounded degrees are sufficient to 
achieve the rate-distortion bound under optimal encoding. 
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Much less is known about the performance of sparse graph 
codes under message-passing encoding. In [8] the authors 
consider binary erasure quantization, the source-compression 
equivalent of the binary erasure channel (BEC) coding prob- 
lem. They show that LDPC -based quantizers fail if the parity 
check density is o(log(iV)) but that properly constructed low- 
density generator-matrix (LDGM) based quantizers combined 
with message-passing encoders are optimal. They exploit the 
close relationship between the channel coding problem and the 
lossy source compression problem, together with the fact that 
LDPC codes achieve the capacity of the BEC under message- 
passing decoding, to prove the latter claim. 

Regular LDGM codes were considered in [9]. Using non- 
rigorous methods from statistical physics it was shown that 
these codes approach rate-distortion bound for large degrees. 
It was empirically shown that these codes have good per- 
formance under a variant of belief propagation algorithm 
(reinforced belief propagation). In [10] the authors consider 
check-regular LDGM codes and show using non-rigorous 
methods that these codes approach the rate-distortion bound 
for large check degree. Moreover, for any rate strictly larger 
than R{D), the gap between the achieved distortion and 
D vanishes exponentially in the check degree. They also 
observe that belief propagation inspired decimation (BID) 
algorithms do not perform well in this context. In [11], survey 
propagation inspired decimation (SID) was proposed as an 
iterative algorithm for finding the solutions of K-SAT (non- 
linear constraints) formulae efficiently. Based on this success, 
the authors in [10] replaced the parity-check nodes with non- 
linear constraints, and empirically showed that using SID one 
can achieve a performance close to the rate-distortion bound. 

The construction in [8] suggests that those LDGM codes 
whose duals (LDPC) are optimized for the binary symmet- 
ric channel (BSC) might be good candidates for the lossy 
compression of a BSS using message-passing encoding. In 
[12] the authors consider such LDGM codes and empirically 
show that by using SID one can approach very close to the 
rate-distortion bound. They also mention that even BID works 
well but that it is not as good as SID. Recently, in [13] it was 
experimentally shown that using BID it is possible to approach 
the rate-distortion bound closely. The key to making basic BP 
work well in this context is to choose the code properly. This 
suggests that in fact the more sophisticated algorithms like 
SID may not even be necessary. 

In [14] the authors consider a different approach. They show 
that for any fixed 7, e > the rate-distortion pair {R{D) + 
7, D+e) can be achieved with complexity Ci(7)e~*-^^'-'''-' A^. Of 
course, the complexity diverges as 7 and e are made smaller. 
The idea there is to concatenate a small code of rate i?+7 with 
expected distortion D + e. The source sequence is then split 
into blocks of size equal to the code. The concentration with 
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respect to the blocklength implies that under MAP decoding 
the probabiHty that the distortion is larger than D + e vanishes. 

Polar codes, introduced by Arikan in [15], are the first 
provably capacity achieving codes for arbitrary symmetric 
binary-input discrete memoryless channels (B-DMC) with low 
encoding and decoding complexity. These codes are naturally 
suited for decoding via successive cancellation (SC) [15]. It 
was pointed out in [15] that an SC decoder can be implemented 
with 6(A^log(A^)) complexity. 

We show that polar codes with an SC encoder are also 
optimal for lossy source compression. More precisely, we 
show that for any design distortion < D < ^, and any 
5 > and < (3 < i, there exists a sequence of polar codes 
of rate at most R{D) + 5 and increasing length N so that their 
expected distortion is at most £' + 0(2~(^'^''). Their encoding 
as well as decoding complexity is 6(7Vlog(iV)). 

II. Introduction to Polar Codes 

Let W : {0,1} ^ y he a binary-input discrete mem- 
oryless channel (B-DMC). Let liW) e [0, 1] denote the 
mutual information between the input and output of W with 
uniform distribution on the inputs, call it the symmetric 
mutual information. Clearly, if the channel W is symmetric, 
then I{W) is the capacity of W. Also, let Z(W) G [0, 1] 
denote the Bhattacharyya parameter of W, i.e., Z{W) = 
Ey^y VWiy\0)Wiy\l). 

In the following, an upper case letter U denotes a random 
variable and and u denotes its realization. Let U denote the 
random vector {Uq, . . . , Un-i)- For any set F, \F\ denotes 
its cardinality. Let Up denote {Ui-^, . . . ,Ui^j^^) and let up 
denote {ui^, . . . ,Ui^p^), where {ik G F : ik < ik+i}- Let 
Uf denote the random vector {Ui, . . . ,Uj) and, similarly, uj 
denotes {u.i, . . . ,Uj). We use the equivalent notation for other 
random variables like X or Y. Let Ber(p) denote a Bernoulli 
random variable with Pr(l) = p. 

The polar code construction is based on the following 
observation. Let 
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Let An : {0, ...,2" - 1} ^ {0,...,2" - 1} be a 
permutation defined by the bit-reversal operation in [15] 
Apply the transform A^Gf " (where denotes the n' 

Kronecker power) to a block of = 2" bits and transmit 
the output through independent copies of a B-DMC W (see 
Figure [T]). As n grows large, the channels seen by individual 
bits (suitably defined in [15]) start polarizing: they approach 
either a noiseless channel or a pure-noise channel, where 
the fraction of channels becoming noiseless is close to the 
symmetric mutual information I{W). 

In what follows, let i7„ = AnGf". Consider a random 
vector U that is uniformly distributed over {0,1}^. Let X = 
UHn, where the multiplication is performed over GF(2). Let 
Y be the result of sending the components of X over the 
channel W. Let P{U,X,Y) denote the induced probability 
distribution on the set {0, 1}^ x {0, 1}^ x 3^^. The channel 
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Fig. 1. The transform A^G®" is applied to the information word U and 
the resulting vector X is transmitted through the chaimel W. The received 
word is Y. 



between U and Y is defined by the transition probabilities 

N-l N-1 

py I c/(y I ^) = n ^^y^ 1 = n ^(y^ I (^^")^)- 

Define W^^ : {0, 1} ^ 3^^ x {0, Ij^-^ as the channel 
with input Ui, output (uq^^, '"o^^)' ™d transition probabilities 
given by 

W^'\y,ul-'\u,)^Piy,ul-'\u,) 

P{y\u)P{u) 



E 



Pin.,) 

\^Y.Py\u^y\^)- (2) 



Let Z^*' denote the Bhattacharyya parameter of the channel 
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Af-l i-1 



(3) 



The SC decoder operates as follows: the bits Ui are decoded 
in the order to — 1. The likelihood of Ui is computed 
using the channel law W^'*' (y, Wq^^ | u,), where u}^^ are the 
estimates of the bits C/q^^ from the previous decoding steps. 

In [15] it was shown that the fraction of the channels 
VF'^*-' that are approximately noiseless approaches I{W). More 
precisely, it was shown that the {Z*^''} satisfy 



lim 

n — ^oo 



||i e {0,...,2"- 1} : < 2-^|| 



nW). (4) 



In [16], the above result was significantly strengthened to 

||ie{0,...,2"-l}:Z« <2-2"^}| 



lim 



^nw), (5) 



which is valid for any < (3 < |. 

This suggests to use these noiseless channels (i.e., those 
channels at position i so that Z^*^ < 2~^ ) for transmitting 
information while fixing the symbols transmitted through the 
remaining channels to a value known both to sender as well 
to the receiver. Following Arikan, call those components Ui 
of U which are fixed "frozen," (denote this set of positions as 
F) and the remaining ones "information" bits. If the channel 
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W is symmetric we can assume without loss of generality 
that the fixed positions are set to 0. In [15] it was shown 
that the block error probability of the SC decoder is bounded 
by E^eF^^'^' which is of order 0(2-2"") f^j. 

our choice. 

Since the fraction of approximately noiseless channels tends 
to I{W), this scheme achieves the capacity of the underlying 
symmetric B-DMC W. 

In [15] the following alternative interpretation was men- 
tioned; the above procedure can be seen as transmitting a 
codeword of a code defined through its generator matrix as 
follows. A polar code of dimension < fc < 2" is defined by 
choosing a subset of the rows of i/„ as the generator matrix. 
The choice of the generator vectors is based on the values of 
Z^'^\ A polar code is then defined as the set of codewords of 
the form x = uHn, where the bits z G F are fixed to 0. The 
well known Reed-Muller codes can be considered as special 
cases of polar codes with a particular rule for the choice of 
F. 

Polar codes with SC decoding have an interesting, and of as 
yet not fully explored, connection to the recursive decoding of 
Reed-Muller codes as proposed by Dumer [17]. The Plotkin 
{u, u + v) construction in Dumer's algorithm plays the role 
of the channel combining and channel splitting for polar 
codes. Perhaps the two most important differences are (i) the 
construction of the code itself (how the frozen vectors are 
chosen), and (ii) the actual decoding algorithm and the order 
in which information bits are decoded. A better understanding 
of this connection might lead to improved decoding algorithms 
for both constructions. 
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Fig. 2. Factor graph representation used by tlie SC decoder. W(yi is 
the initial prior of the variable Xi, when yi is received at the output of a 
symmetric B-DMC W. 



To summarize, the SC decoder operates as follows. 
For each i in the range till — 1: 

(i) If i e i^, then set m, = 0. 

(ii) If i e F"^-, then compute 



= 0) 

W<^^{y,ul-^\u, = 1) 



and set 



As explained in [15] using the factor graph representation 
shown in Figure |2l the SC decoder can be implemented 
with complexity 9(A^ log(A^)). A similar representation was 
considered for decoding of Reed-Muller codes by Forney in 
[18]. 

A. Decimation and Random Rounding 

In the setting of channel coding there is typically one 
codeword (namely the transmitted one) which has a posterior 
that is significantly larger than all other codewords. This 
makes it possible for a greedy message-passing algorithm to 
successfully move towards this codeword in small steps, using 
at any given moment "local" information provided by the 
decoder 

In the case of lossy source compression there are typically 
many codewords that, if chosen, result in similar distortion. 
Let us assume that these "candidates" are roughly uniformly 
spread around the source word to be compressed. It is then 
clear that a local decoder can easily get "confused," producing 
locally conflicting information with regards to the "direction" 
into which one should compress. 

A standard way to overcome this problem is to combine the 
message-passing algorithm with decimation steps. This works 
as follows; first run the iterative algorithm for a fixed number 
of iterations and subsequently decimate a small fraction of the 
bits. More precisely, this means that for each bit which we 
decide to decimate we choose a value. We then remove the 
decimated variable nodes and adjacent edges from the graph. 
One is hence left with a smaller instance of essentially the 
same problem. The same procedure is then repeated on the 
reduced graph and this cycle is continued until all variables 
have been decimated. 

One can interpret the SC operation as a kind of decima- 
tion where the order of the decimation is fixed in advance 
(0, . . . , — 1). In fact, the SC decoder can be interpreted as 
a particular instance of a BID. 

When making the decision on bit Ui using the SC decoder, 
it is natural to choose that value for Ui which maximizes 
the posterior. Indeed, such a scheme works well in practice 
for source compression. For the analysis however it is more 
convenient to use randomized rounding. In each step, instead 
of making the MAP decision we replace (|6]l with 

0, w.p. 

1, w.p. 

In words, we make the decision proportional to the likelihoods. 
Randomized rounding as a decimation rule is not new. E.g., 
in [19] it was used to analyze the performance of BID for 
random K-SPS problems. 

For lossy source compression, the SC operation is employed 
at the encoder side to map the source vector to a codeword. 
Therefore, from now onwards we refer to this operation as SC 
encoding. 



III. Main Result 



A. Statement 



0, if k > 1, 

1, if /, < 1. 



(6) 



Theorem 1 (Polar Codes Achieve the Rate-Distortion Bound): 
Let y be a BSS and fix the design distortion D, Q < D < ^. 
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For any rate R > I — h2{D) and any Q < (3 < i, there exists 
a sequence of polar codes of length with rates < R 
so that under SC encoding using randomized rounding they 
achieve expected distortion Dn satisfying 

Dn<D + 0{2-^''"'>). 

The encoding as well as decoding complexity of these codes 

is e(iviog(iv)). 

B. Simulation Results and Discussion 

Let us consider how polar codes behave in practice. Recall 
that the length N of the code is always a power of 2, i.e., 

= 2". Let us construct a polar code to achieve a distortion 
D. Let W denote the channel BSC(i:') and let R = R{D) + S 
for some (5 > 0. 

In order to fully specify the code we need to specify the set 
F, i.e., the set of frozen components. We proceed as follows. 
First we estimate the Z'*'s for all i € {0, A^ — 1} and sort the 
indices i in decreasing order of Z'*^s. The set F consists of the 
first RN indices, i.e., it consists of the indices corresponding 
to the RN largest Z(''s. 

This is similar to the channel code construction for the 
BSC(-D) but there is a slight difference. For the case of channel 
coding we assign all indices i so that Z^*^ is very small, i.e., 
so that lets say Z''' < 6, to the set F"^. Therefore, the set F 
consists of all those indices i so that Z*^*' > S. 

For the source compression, on the other hand, F consists 
of all those indices i so that Z*-'-* > 1 — S, i.e., of all those 
indices corresponding to very large values of Z^'^\ 

Putting it differently, in channel coding, the rate R is 
chosen to be strictly less than 1 — h2{D), whereas in source 
compression it is chosen so that it is strictly larger than this 
quantity. Figure [3] shows the performance of the SC encoding 
algorithm combined with randomized rounding. As asserted 
by Theorem [T] the points approach the rate-distortion bound 
as the block length increases. 
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Fig. 3. The rate-distortion performance for tlie SC encoding algorithm with 
randomized rounding for n = 9, 11, 13, 15, 17 and 19. As the block length 
increases the points move closer to the rate-distortion bound. 

In [20] the performance of polar codes for lossy source 
compression was already investigated empirically. Note that 
the construction used in [20] is different from the current 
construction. Let us recall. Consider a BSC(p), where p = 



h^^il — h2{D)). Let the corresponding Bhattacharyya con- 
stants be Z(''s. In [20] first a channel code of rate 1 — /i2(p) — e 
is constructed according to the values Z'*'s. Let F be the 
corresponding frozen set. The set F for the source code is 
given by 



F ^{N -I 



i e F''}. 



The rationale behind this construction is that the resulting 
source code is the dual of the channel code designed for 
the BSC(p). The rate of the resulting source code is equal to 
h2{p) + f- = 1 — h2{D) + e. Although this code construction is 
different, empirically the resulting frozen sets are very similar. 

There is also a slight difference with respect to the decima- 
tion algorithm. In [20] the decimation step is based on MAP 
estimates, whereas in the current setting we use randomized 
rounding. 

Despite all these differences the performance of both 
schemes is comparable. 

IV. The Proof 
From now on we restrict to be a BSC(D), i.e., 

W{Q\l) = W{l\Qi) = D, 
W{0\0) W^(l|l) = l-D. 

As immediate consequence we have 

W{y\x) = W{y® z\x® z). (7) 
This extends in a natural way if we consider vectors. 

A. The Standard Source Coding Model 

Let us describe lossy source compression using polar codes 
in more detail. We refer to this as the "Standard Model." In 
the following we assume that we want to compress the source 
with average distortion D. 

Model: Let y ^ {yo, . . . , yN-i) denote N i.i.d. realizations 
of the source F. Let C {0, . . . , TV - 1} and let iip € 
{0, 1} 1^1 be a fixed vector In the sequel we use the shorthand 
"SM(i^, iti?)" to denote the Standard Model with frozen set F 
whose components are fixed to up. It is defined as follows. 

Encoding: Let /"^ : {0,1}^ {O,!}^"!^! denote the 
encoding function. For a given y we first compute u, as 
described below, where u ~ {uq, . . . , un-i)- Then /"^(y) = 

Given y, for each i in the range till — 1: 
(i) Compute 



W^'\y,ui-^\u, = 0) 
W(^){y,ul-'\u, = iy 



(ii) If i G F'^ then set Ui = with probability and equal 
to 1 otherwise; if i F then set Ui — ut. 

Decoding: The decoding function /"^ : {0, ^ 
{0, 1}^ maps up': back to the reconstruction point x via x = 
uHn, where up = up. 

Distortion: The average distortion incurred by this scheme 
is given by E[d(y, X)], where the expectation is over the 
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source randomness and the randomness involved in the ran- 
domized rounding at the encoder 

Complexity: The encoding (decoding) task for source coding 
is the same as the decoding (encoding) task for channel coding. 
As remarked before, both have complexity Q{N\ogN). 

Remark; Recall that is the posterior of the variable 
Ui given the observations Y as well as Uq~^, under the 
assumption that U has uniform prior and that Y is the result 
of transmitting UHn over a BSC(Z)). 



B. Computation of Average Distortion 

The encoding function /"^ is random. More precisely, in 
step i of the encoding process, i E , we fix the value 
of Ui proportional to the posterior (randomized rounding) 



(u,; I Mq ^, y). This implies that the probability of 



picking a vector u given y is equal to 



0, 

n 



Up ^ Up, 

l-Up'^y), uf = uf. 



Therefore, the average (over y and the randomness of the 
encoder) distortion of SM{F, up) is given by 



i/e{o,i}" uj=,cG{o,i}i^" 
l[ P{u^\ul-\y)diy,uHn), (8) 



where Ui = Ui for i E F. 

We want to to show that there exists a set F of cardinality 
roughly Nh2{D) and a vector up such that Dn{F,up) w 
D. This will show that polar codes achieve the rate-distortion 
bound. 

For the proof it is more convenient not to determine the 
distortion for a fixed choice of up but to compute the average 
distortion over all possible choices (with a uniform distribution 
over these choices). Later, in Section [Vl we will see that the 
distortion does not depend on the choice of up. A convenient 
choice is therefore to set it to zero. This will lead to the desired 
final result. 

Let us therefore start by computing the average distor- 
tion. Let Dn{F) denote the distortion obtained by averaging 
Di\j{F,up) over all 2l^l possible values of up. We will show 
that DNiF) is close to D. 

The distortion Dn{F) can be written as 



Dn{F) = E w\^^(F^ 



Up) 



2\i 

fiFe{o.i}i^i 

-y—y- 

21^1 ^ 2^ 

Up y 

E n P{u^\u'^-\y)d{y,uHn) 

^ E ^ E i n I ^o"' . uHn 

y u i^F'^ 



Let Qij Y denote the distribution defined by Qyiv) ~ jn- and 
QijiY defined by 



Then, 



1 

2 ' 



I U'-\Y 



if iEF, 

-1 |uo"\y), if iEF". 



(9) 



Djv(P) =EQ[d(y,C7i7„)], 

where Eq[-] denotes expectation with respect to the distribu- 
tion 



:u,Y- 



Similarly, let Ep[-] denote the expectation with respect to 
the distribution y- Recall that Pyiv) = jit and that we 



can write Pjj i y in the form 



JV-l 



1=0 

If we compare Q to P we see that they have the same structure 
except for the components iEF. Indeed, in the following 
lemma we show that the total variation distance between Q 
and P can be bounded in terms of how much the posteriors 

Qu, I u'^^ Y ^"'^ ^Ui I c/'"^ Y differ for iEF. 

Lemma 2 (Bound on the Total Variation Distance): Let F 
denote the set of frozen indices and let the probability dis- 
tributions Q and P be as defined above. Then 



Y^\Q{u,y) - P{u,y)\ 
1 



P 



< 2E^J 

Proof: 

y^\Qiu\y)~Piu\y)\ 

U 

N~l 

= E| n Q(nA<\v)- n nu^\ul-\y) 

u i—0 i—0 
N-1 

= E I E I "o"\ y) - Piu^ I <\y)) ■ 



N-l 



u i—Q 



N-l 



(np(%>r\y))( n Qi^,\<\y))' 

j=0 j=i+l 

In the last step we have used the following telescoping 
expansion: 



N-l 



N-l 



- B^' - E ^o<i^ - E ^o-^sf -\ 



i=0 



where A^, denotes here the product Yil^k 



Now note that if z E F'^ then Q{ui\u^Q~^ ,y) = 
P{ui\ u'q~^ ,y), so that these terms vanish. The above sum 
therefore reduces to 

J2\j2\ {Qiu,\ul\y)-P{u,\ul-'\y)l ■ 



u i£P 



<|i-P(«, |u^-i,y)| 



6 



N-1 



j=0 3=1+1 

^ E E I J - ^("' I "o"\ y) I n I ui-\ y) 



(a) 



3=0 



<2E^^.,.-^« 



ieF 



2 Pu,\Wo'',Y 



-i^(0|C/^-\y) 



In the last step the summation over Ui gives rise to the factor 2, 
whereas the summation over Uq~^ gives rise to the expectation. 

Note that Qyiv) = Pyiv) — 2^- The claim follows by 
taking the expectation over Y . ■ 

Lemma 3 (Distortion under Q versus Distortion under P): 
Let F be chosen such that for i G F 



< s 



N- 



(10) 



The average distortion is then bounded by 



^EQ[d(y,t/i7„)] < ^Ep[d(r,c7H„)] + IfI'^Sn. 

Proof: 

EQ[d(r, UHn)] - Ep[d(r, C/i7„)] 

<iV^|Q(iZ,y)-P(S,y) 
1 



Lem. 12] ^ — , 

< \F\2NdN. 



From Lemma [3] we see that the average (over y as well as 
u f) distortion of the Standard Model is upper bounded by the 
average distortion with respect to P plus a term which bounds 
the "distance" between Q and P. 

Lemma 4 (Distortion under P): 

_Ep[diY,UH,,)]^ ND. 
Proof: Let X ~ UHn and write 

Ep[d(F,c7i/„)] 

= E Pu,x,Y{u,x,y) d{y,uHn) 

Px,Yix,y)Ptj\x,Y{u\x,y) d(y,x) 



y.u,x 



y.u,x 



{0, l}-valued 

Note that the unconditional distribution of X as well as Y is 
the uniform one and that the channel between X and Y is 
memoryless and identical for each component. Therefore, we 
can write this expectation as 

Ep[d(F,C7i/„)] = ^ E Pxo,Yo{xQ,yQ) d(j/o,a;o) 



NY,Px,{xo)Y.W{yo\xo) 



xo Vo 



= NW{0 1 1) = ND. 

In the above equation, (a) follows from the fact that 
Py \x{y\x) = W{y I x), and (6) follows from our assumption 
that Wis?i BSC(L»). ■ 

This implies that if we use all the variables {Ui} to represent 
the source word, i.e., F is empty, then the algorithm results in 
an average distortion D. But the rate of such a code would be 
1. Fortunately, the last problem is easily fixed. If we choose F 
to consist of those variables which are "essentially random," 
then there is only a small distortion penalty (namely, |i^|257v) 
to pay with respect to the previous case. But the rate has been 
decreased to 1 - |F|/A^. 

Lemma [3] shows that the guiding principle for choosing the 
set F is to include the indices with small (5 at in ( fTol i. In the 
following lemma, we find a sufficient condition for an index 
to satisfy ( fTOb . which is easier to handle. 

Lemma 5 fZ^*' Close to 1 is Good): If Z(') > 1 - 25%, 
then 

^ " A^\ui-\Y) 
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The equality (a) follows from the fact that Pir{u) 
all u G {0,1}^ 



2" 
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Assume now that Z^*' > 1 - 2(5^. Then 
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Multiplying and dividing the term inside the expectation with 
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and upper bounding this term in the denominator with 1, we 
get 



Now, using the equality \ 



PP 



(i -pf, we get 



P, 



I m 



-.?(0|f/r\F) 



<4- 



The resuh now follows by applying the Cauchy-Schwartz 
inequality. ■ 

We are now ready to prove Theorem [T] In order to show 
that there exists a polar code which achieves the rate-distortion 
tradeoff, we show that the size of the set F can be made 
arbitrarily close to Nh2{D) while keeping the penalty term 
|F|2(5Ar arbitrarily small. 

Proof of Theorem |7} 

Let /3 < i be a constant and let (J^r = jTv^^^ ■ Consider 
a polar code with frozen set F/v, 

Fn = {i e {0, . . . , TV - 1} : > 1 



24}- 



For N sufficiently large there exists a < -i such that 25 



N 



> 



Theorem [16] and equation ( fT9] l imply that 

'Fn\ 



N=2",7i-too N 



h2{D). 



(11) 



For any e > this implies that for N sufficiently large there 
exists a set Fjv such that 



\F 



N\ 



N 



>h2{D)~e. 



In other words 



R 



\F 
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N 



<R{D) 



(12) 



Finally, from Lemma [3] we know that 

Dn{Fn) <D + 2\Fn\Sn <D + 0(2-(^'')) 

for any < /3 < i. 

Recall that Dn{Fn) is the average of the distortion over 
all choices of up- Since the average distortion fulfills ( fT2b it 
follows that there must be at least one choice of u f„ for which 

Dn(Fn,uf^)<D + 0{2-^'''^) 

for any < /3 < i. 

The complexity of the encoding and decoding algorithms 
are of the order Q{N log{N)) as shown in [15]. ■ 

V. Value of Frozen Bits Does Not Matter 

In the previous sections we have considered Dn{F), the 
average distortion if we average over all choices of up. We 
will now show a stronger result, namely we will show that all 
choices for up lead to the same distortion, i.e., Dn{F,up) 
is independent of up. This implies that the components 
belonging to the frozen set F can be set to any value. A 
convenient choice is to set them to 0. In the following let F 
be a fixed set. The results here do not dependent on the set 
F. 



Lemma 6 (Gauge Transformation): Consider the Standard 
Model introduced in the previous section. Let y,y' E {0, 1}^ 
and let u^-^ = u'^"^ {{y y')H-%-\ Then 



Proof: 



W,u'l-'), ifiiy®y')H~^),^0, 



l/k{y',u'l-'), if{{y®y')H-^), = l. 
W^W(y,u^-l|l) 

m P^y' I 0, ®{y® y')Hn') 

" E„«- Piy' I K"': 1: ® (y © y')Hn') 

P{y' I (u'r\ ® iiy © y')Hn'h 



P{y' I K-\ 1 © {{y © y')i/,7^)., 

_ W<~'\y',ur'\0®{{y®y')H-^)^) 
W(^){y',u'l-'\l®{{y®y')Hn%)' 

The claim follows by considering the two possible values of 
{{y®y')H-^),. m 
Recall that the decision process involves randomized rounding 
on the basis of k. Consider at first two tuples (y, Uq~^) and 
(y', u'o^^) so that their associated k values are equal; we have 
seen in the previous lemma that many such tuples exist. In 
this case, if both tuples have access to the same source of 
randomness, we can couple the two instances so that they 
make the same decision on Ui. An equivalent statement is 
true in the case when the two tuples have the same reliability 
\log{li{y,UQ~^))\ but different signs. In this case there is a 
simple coupling that ensures that if for the first tuple the 
decision is lets say Ui = then for the second tuple it is 
Ui ^ 1 and vice versa. Hence, if in the sequel we compare 
two instances of "compatible" tuples which have access to 
the same source of randomness, then we assume exactly this 
coupling. 

Lemma 7 (Symmetry and Distortion): Consider the Stan- 
dard model introduced in the previous section. Let y, y' € 

{0, 1}^, F C {0, . . . , iV - 1}, and up, u'p e {0, Ijl^L If 
Up ~ u'pQ){{yQ)y')H^^)p, then under the coupling through 
a common source of randomness /"^(y) = f^''{y') © ((y © 

y')H~^)F^- 

Proof: Let u, u' be the two N dimensional vectors 
generated within the Standard Model. We use induction. Fix 
< i < - 1. We assume that for j < i, uj = Uj © {{y © 
y')H^^)j. This is in particular correct if i = 0, which serves 
as our anchor 

By Lemma |6] we conclude that under our coupling the 
respective decisions are related as Ui = © ((y © y')H~^)i 
if i G F'^. On the other hand, if i £ F, then the claim is true 
by assumption. ■ 

Let V e {0, l}!-^! and let A{v) C {0, 1}^ denote the coset 

A{v) = {y : {yH-')p = v}. 
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The set of source words {0, 1}^ can be partitioned as 

{0,l}^ = U,g{„,i}|.|A(z;). 

Note that all the cosets A{v) have equal size. 

The main result of this section is the following lemma. The 
lemma implies that the distortion of SM(i^, up) is independent 

of Up. 

Lemma 8 (Independence of Average Distortion w.r.t. up): 
Fix F C {0, . . . , iV — 1}. The average distortion D]\[{F,up) 
of the model SM{F,up) is independent of the choice of 

^Fe{0,i}l^l. 

Proof: Let up,u'p g {0, l}'^' be two fixed vectors. We 
will now show that Dn{F,up) ~ Dn{F,u'p). Let y,y' be 
two source words such that y £ A{v) and y' e A{vQ)upQ)u'p), 
i.e., u'p = Up S) ((y ® y')H~^)p. Lemma [T] implies that 

f-{y') = r-{y)®{[y®y')H-')p.. 

This implies that the reconstruction words are related as 

rnnm = r-if-iy')) © {y®y')H-\ 

Note that /"^(/"^(27))©y is the quantization error. Therefore 

<y.n{nm = d(y',r-(r-(y'))), 
which further implies 

Hence, the average distortions satisfy 

y 

= E ^ E d(y,r-(r-(27))) 
= E ^ E ^yJ'HfHm 
= E ^ E ^(yJ^HfHm 

oe{o,i}i-^i yeA{v) 

=E^^(^'/''"(/''"(y)))- 

y 

As mentioned before, the functions /"^ and Z"*" are not 
deterministic and the above equality is valid under the assump- 
tion of coupling with a common source of randomness. Av- 
eraging over this common randomness, we get Dpf{F, up) = 
Dn{F,u'p). ■ 

Let Q"^ denote the empirical distribution of the quantiza- 
tion noise, i.e., 

Q^'' {x) = ^[l^Y®f^F(f^F(Y))=x]\^ 

where the expectation is over the randomness involved in 
the source and randomized rounding. Continuing with the 
reasoning of the previous lemma, we can indeed show that 
the distribution Q"'^ is independent of up. Combining this 
with Lemma m we can bound the distance between Q"*' and 
an i.i.d. Ber(Z?) noise. This will be useful in settings which 



involve both channel and source coding, like the Wyner-Ziv 
problem, where it is necessary to show that the quantization 
noise is close to a Bernoulli random variable. 

Lemma 9 (Distribution of the Quantization Error): Let the 
frozen set F be 

F = {i: Z« > 1 - 2(5^}. 
Then for up fixed, 

X i 

Proof Recall that Px|y(x|y) = Y{^W{x^\yi). Let 
V G {0, be a fixed vector Consider a vector y G A{v) 
and set y' = 0. Lemma |7] implies that /"^(y) = (0) ® 

{yH~^)pc. Therefore, 

v®r^{f^{y)) = 00 /"^©«(/"^©«(o)). 

This implies that all vectors belonging to A{v) have the same 
quantization error and this error is equal to the error incurred 
by the all-zero word when the frozen bits are set to up ® v. 

Moreover, the uniform distribution of the source induces a 
uniform distribution on the sets A{v) where v e {0,1}'^'. 
Therefore, the distribution of the quantization error Q"^ is 
the same as first picking the coset uniformly at random, i.e., 
the bits Up, and then generating the error x according to a; = 
(/''^ (0)). The distribution of the vector u where u = 
xH~^ is indeed the distribution Q defined in Recall that in 
the distribution Pfj x ^ ^^e related as 17 = XH.^^. 

Therefore, the distribution induced by VF(a; | y) on U is Po | y . 
Since multiplication with H^^ is a one-to-one mapping, the 
total variation distance can be bounded as 

Ei2""(*)-n^(*io)i=EiQ("i5)-^pi?("io)i 

X i u 

(a) 

< 2\F\Sn. 

The inequality (a) follows from Lemma |2] and Lemma |5] ■ 

VI. Beyond Source Coding 

Polar codes were originally defined in the context of channel 
coding in [15], where it was shown that they achieve the capac- 
ity of symmetric B-DMCs. Now we have seen that polar codes 
achieve the rate-distortion tradeoff for lossy compression of a 
BSS. The natural question to ask next is whether these codes 
are suitable for problems that involve both quantization as well 
as error correction. 

Perhaps the two most prominent examples are the source 
coding problem with side information (Wyner-Ziv problem 
[21]) as well as the channel coding problem with side in- 
formation (Gelfand-Pinsker problem [22]). As discussed in 
[23], nested linear codes are required to tackle these problems. 
Polar codes are equipped with such a nested structure and are, 
hence, natural candidates for these problems. We will show 
that, by taking advantage of this structure, one can construct 
polar codes that are optimal in both settings (for the binary 
versions of these problems). Hence, polar codes provide the 
first provably optimal low-complexity solution. 
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In [7] the authors constructed MN codes which have the 
required nested structure. They show that these codes achieve 
the optimum performance under MAP decoding. How these 
codes perform under low complexity message-passing algo- 
rithms is still an open problem. Trellis and turbo based codes 
were considered in [24]-[27] for the Wyner-Ziv problem. It 
was empirically shown that they achieve good performance 
with low complexity message-passing algorithms. A similar 
combination was considered in [28]-[30] for the Gelfand- 
Pinsker problem. Again, empirical results close to the optimum 
performance were obtained. 

We end this section by applying polar codes to a multi- 
terminal setup. One such scenario was considered in [20], 
where it was shown that polar codes are optimal for lossless 
compression of a correlated binary source (the Slepian-Wolf 
problem [31]). The result follows by mapping the lossless 
source compression task to a channel coding problem. 

Here we consider another multi-terminal setup known as 
the one helper problem [32]. This problem involves channel 
coding at one terminal and source coding at the other. We again 
show that polar codes achieve optimal performance under low- 
complexity encoding and decoding algorithms. 

A. Binary Wyner-Ziv Problem 

Let F be a BSS and let the decoder have access to a random 
variable Y' . This random variable is usually called the side 
information. We assume that Y' is correlated to Y ?& Y' ^ 
Y + Z, where Z is a Ber(p) random variable. The task of 
the encoder is to compress the source Y, call the result X, 
such that a decoder with access to {Y' , X) can reconstruct the 
source to within a distortion D. 
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Fig. 4. The side information Y' is available at the decoder. The decoder 
wants to reconstruct the source Y to within a distortion D given X. 

Wyner and Ziv [21] have shown that the rate-distortion 
curve for this problem is given by 

Lc.e.{(i?wz(i5),i^),(0,p)}, 

where R^JJJ) = h-ziD *p) — h2{D), l.c.e. denotes the lower 
convex envelope, and D * p = £'(1 — p) + p(l — D). Here 
we focus on achieving the rates of the form R„z{D)- The 
remaining rates can be achieved by appropriate time-sharing 
with the pair (0,p). 

The proof is based on the following nested code construc- 
tion. Let Cs denote the polar code defined by the frozen set Fg 
with the frozen bits up^ set to 0. Let Cc{v) denote the code 
defined by the frozen set Fc D Fg with the frozen bits up^ 
set to and uf^\f, = v. This implies that the code Cs can be 
partitioned as C., = UyCc{v). 



The code is designed to be a good source code for 
distortion D and for each v the code Cc{v) is designed to 
be a good channel code for the BSC(Z) * p). 

The encoder compresses the source vector y to a vector 
Upc through the map Upo = The reconstruction vector 

X is given by X = f'^if^iY)). Since the code is a good 
source code, the quantization error Y(BX is close to a Ber(i?) 
vector (see Lemma |9]l. This implies that the vector Y' which 
is available at the decoder is statistically equivalent to the 
output of a BSC(£' * p) when the input is X. The encoder 
transmits the vector V = Up^\p^ to the decoder. This informs 
the decoder of the code Cc{V) which is used. Since this code 
Cc{V) is designed for the BSC(£' * p), the decoder can with 
high probability determine X given Y'. By construction, X 
represents Y with distortion roughly D as desired. 

Theorem 10 (Optimality for the Wyner-Ziv Problem): Let 
y be a BSS and Y' be a Bernoulli random variable correlated 
to F as F' — Y (B Z, where Z ^ Ber(p). Fix the design 
distortion D, < Z) < i. For any rate R > h2{D^p)-h2{D) 
and any < /3 < i, there exists a sequence of nested polar 
codes of length N with rates Rm < R so that under SC 
encoding using randomized rounding at the encoder and SC 
decoding at the decoder, they achieve expected distortion Dn 
satisfying 

Dn<D + 0{2-^^'^), 
and the block error probability satisfying 

< 0(2-(^')). 

The encoding as well as decoding complexity of these codes 
is e(A^log(iV)). 

Proof: Let e > and < /3 < | be some constants. Let 
Z^^\q) denote the Z^'h computed with W set to BSC((7). Let 
(5jv = ;i-2~(^'^'. Let Fs and F^ denote the sets 

F. = {*:Z«(i?)>l-4}, 
F, = {i: Z^''>{D*p) > Sn}. 

Theorem [16] implies that for N sufficiently large 



Similarly, Theorem [15] implies that for N sufficiently large 

\^<h2iD.p)+'-. 

The degradation of BSC(L' * p) with respect to BSC(£') 
implies that Fg C F^. 

The bits Fg are fixed to 0. This is known both to the 
encoder and the decoder. A source vector y is mapped to 
upa = f^{y) as shown in the Standard Model. Therefore the 
average distortion Dn is bounded as 

Dn <D + 2\Fg\SN <D + 0(2-(^'')). 

The encoder transmits the vector up^\p^ to the decoder The 
required rate is 

Rn - < h2iD * p) - h2{p) + e. 
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It remains to show that at the decoder the block error 
probability incurred in decoding X is 0(2^'^*'^). 

Let E denote the quantization error, E = Y (£> X. The 
information available at the decoder (Y') can be expressed as, 

Y' = X0E0Z. 

Consider the code Cc{v) for a given v and transmission 
over the BSC{D*p). Let £ C {0, 1}^ denote the set of noise 
vectors of the channel which result in a decoding error under 
SC decoding. By the equivalent of Lemma [8] for the channel 
coding case, this set does not depend on v. 

The block error probability of our scheme can then be 
expressed as 

Pn = IE[l{B©ze£}]- 

The exact distribution of the quantization error is not known, 
but Lemma |9] provides a bound on the total variation distance 
between this distribution and an i.i.d. Ber(_D) distribution. Let 
B denote an i.i.d. Bei{D) vector. Let Pg and Pg denote the 
distribution of E and B respectively. Then 



M 



El 



(13) 



Let Pr{B,E) denote the so-called optimal coupling be- 
tween E and B. I.e., a joint distribution of E and B with 
marginals equal to Pg and Pg, and satisfying 



(14) 



It is known [33] that such a coupling exists. Let E and 
B be generated according to Pr(-, ). Then, the block error 
probability can be expanded as 



P, 



N — ^['^{E®Z££}'^{E=B}] + '^['^{EBZe£}'^{E^B}] 



< 



'-{B®Ze£}\ 



{E=^B}\ 



The first term in the sum refers to the block error probability 
for the BSC(Z) * p), which can be bounded as 

miBBZee}] < E Z^'Hd*p) < 0{2~^^'^). (15) 

Using (O, (O and ([T5]l we get 

< 0(2"(^')). 



B. Binary Gelfand-Pinsker Problem 

Let S denote a symmetric Bernoulli random variable. Con- 
sider a channel with state S given by 

Y ^X®S®Z, 

where Z is a Ber(p) random variable. The state S is known 
to the encoder a-causally and not known to the decoder. The 
output of the encoder is constrained to satisfy E[X] < D, i.e., 
on average the fraction of Is it can transmit is bounded by 
D. This is similar to the power constraint in the continuous 
case. The task of the encoder is to transmit a message M to 



Encoder 
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Fig. 5. The state S is known to the encoder in advance. The weight of the 
input X is constrained to E[X] < D. 



the decoder with vanishing error probability under the above 
mentioned input constraint. 

In [34], it was shown that the achievable rate, weight pairs 
for this channel are given by 



.[{RUD),D), (0,0)}, 



where Rcp{D) — h2{D) — h2{p), and u.c.e denotes the upper 
convex envelope. 

Similar to the Wyner-Ziv problem, we need a nested code 
for this problem. However, they differ in the sense that the 
role of the channel and source codes are reversed. 

Let Cc denote the polar code defined by the frozen set Fc 
with frozen bits up^ set to 0. Let Cs{v) denote the code defined 
by the frozen set Fs D Fc, with the frozen bits up^ set to 
and mFs\f<: = v. The code Cc is designed to be a good channel 
code for the BSC(p) and the codes Cs{v) are designed to be 
good source codes for distortion D. This implies that the code 
Cc can be partitioned into Cs{v) for v G {0,1}^=^^% i.e., 

Cc = UyC,{v). 

The frozen bits V = Up^\p^ are determined by the message 
M that is transmitted. The encoder compresses the state vector 
S" to a vector Upc through the map Upc = (S). Let S' be 
the reconstruction vector S' ~ /'^^^ (/'^'^^ ("S"))- The encoder 
sends the vector X = S S) S' through the channel. Since the 
codes Cs{V) are good source codes, the expected distortion 
-^E[d(S', S')] (hence the average weight of X) is close to D 
(see Lemma[8]l. Since the code Cc is designed for the BSC(p), 
the decoder will succeed in decoding the codeword S®X = S' 
(hence the message V) with high probability. 

Here we focus on achieving the rates of the form Rcp{D). 
The remaining rates can be achieved by appropriate time- 
sharing with the pair (0,0). 

Theorem 11 (Optimality for the Gelfand-Pinsker Problem): 
Let S* be a symmetric Bernoulli random variable. Fix D, 
< D < |. For any rate R < h-ziD) — h2{p) and any 
< /3 < i, there exists a sequence of polar codes of length 
N so that under SC encoding using randomized rounding at 
the encoder and SC decoding at the decoder, the achievable 
rate satisfies 

Pjv > R, 

with the expected weight of X, D^, satisfying 

Dn <D + 0{2~^^'">). 
and the block error probability satisfying 

<0(2-(^')). 
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The encoding as well as decoding complexity of these codes 
is Q{N\og{N)). 

Proof: Let e > and < /3 < | be some constants. Let 
Z'-^^q) denote the Z'-^h computed with W set to BSC{q). Let 
Sn = ;^2"(^"'. Let and Fc denote the sets 

F, = {i: Z<^'Hd)>1-S%}, (16) 

F, = {i: z'-'Hp)>Sn}. (17) 

Theorem [T6l implies that for N sufficiently large 

Similarly, Theorem [15] implies that for N sufficiently large 

^<Mp) + |. 

The degradation of BSC(-D) with respect to BSC(p) implies 
that Fc C Fg. The vector up^\p^ is defined by the message 
that is transmitted. Therefore, the rate of transmission is 



IF, 



\Fr, 



N 



> h2{D) - h2{p) - e. 



The vector S is compressed using the source code with 
frozen set Fg. The frozen vector up^ is defined in two stages. 
The subvector up^ is fixed to and is known to both the 
transmitter and the receiver. The subvector up^\p^ is defined 
by the message being transmitted. 

Let S be mapped to a reconstruction vector S'. Lemma [8] 
implies that the average distortion of the Standard Model is 
independent of the value of the frozen bits. This implies 



E[5©S"] < D + 2\Fs\Sn <D + 0{2' 



Therefore, a transmitter which sends X = S (D S' will on 
average be using D + 0(2-(^'')) fraction of Is. The received 
vector is given by 



Y = X(BS®Z = S'®Z. 

The vector S' is a codeword of Cc, the code designed for the 
BSC(p) (see (ITtIi). Therefore, the block error probability of 
the SC decoder in decoding S' (and hence V) is bounded as 



C. Storage in Memory With Defects 

Let us briefly discuss another standard problem in the 
literature that fits within the Gelfand-Pinsker framework but 
where the state is non-binary. Consider the problem of storing 
data on a computer memory with defects and noise, explored 
in [35] and [36]. Each memory cell can be in three possible 
states, say {0, 1, *}. The state S* = (1) means that the value 
of the cell is stuck at (1) and S* = * means that the value 
of the cell is flipped with probability D. Let the probability 
distribution of S be 

Pr(S' = 0) = Vy{S = 1) = p/2, Pr(5' = *) = 1 - p. 



The optimal storage capacity when the whole state realization 
is known in advance only to the encoder is {l—p){\ — h2{D)). 

Theorem 12 (Optimality for the Storage Problem): For 
any rate i? < (1 - p){l - h2{D)) and any < /3 < i, there 
exists a sequence of polar codes of length N so that under 
SC encoding using randomized rounding at the encoder and 
SC decoding at the decoder, the achievable rate satisfies 

and the block error probability satisfying 

<0(2-(^')). 

The encoding as well as decoding complexity of these codes 
is e(A^log(iV)). 

The problem can be framed as a Gelfand-Pinsker setup with 
state S G {0,1,*}. As seen before, the nested construction for 
such a problem consists of a good source code which partitions 
into cosets of a good channel code. We still need to define what 
the corresponding source and coding problems are. 

Source Code: The source code is designed to compress 
the ternary source S to the binary alphabet {0, 1} with design 
distortion D. The distortion function is d(0, 1) = 1, d(*, 1) = 
d(*,0) = 0,. The test channel for this problem is a binary 
symmetric erasure channel (BSEC) shown in Figure [7] The 
compression of this source is explained in Section IVIIII Let 
Z<')(p, D) denote the Bhattacharyya values of BSEC{p,D) 
defined in Figure [T] The frozen set Fg is defined as 

Fg^{i:Z^^Hp,D)>l-5%}- 

The rate distortion function for this problem is given by p(l — 
h2{D)). Therefore, for sufficiently large N, |i^s|/-^ can be 
made arbitrarily close to 1 — p{l — h2{D)). 

Channel code: The channel code is designed for BSC(£'). 
The frozen set Fc is defined as 

Fc^{i:Z^'\D)>5n}. 

Therefore, for sufficiently large N, \Fc\/N can be made 
arbitrarily close to h2{D). Degradation of BSEC{p, D) with 
respect to BSC(L') impHes Fc C Fg. 

Encoding: The frozen bits Up^ is fixed to 0. The vector 
Up^\P^ is defined by the message to be stored. Therefore, the 
achievable rate is 



\Fg 



\Fc 



N 



> {1 - p){l - h2{D)) - e 



for any e > 0. Compress the source sequence using the 
function /'^^^ (S) and store the reconstruction vector X = 
f^''^ if^''" {S)) in the memory. As shown in the Wyner-Ziv 
setting, the quantization noise is close to Ber(Z?) for the stuck 
bits. Therefore, a fraction D of the stuck bits differ from X. 

Decoding: When the decoder reads the memory, the stuck 
bits are read as it is and the remaining bits are flipped 
with probability D. This is equivalent to seeing X through 
a channel BSC(-D). Since the channel code is defined for 
BSC(£'), the decoding will be successful with high probability 
and the message Up^\p^ will be recovered. 
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D. One Helper Problem 

Let r be a BSS and let Y' be correlated to F as F' = Y®Z, 
where Z is a Ber(p) random variable. The encoder has access 
to Y and the helper has access to Y'. The aim of the decoder 
is to reconstruct Y successfully. As the name suggests, the 
role of the helper is to assist the decoder in recovering Y. 
This problem was considered by Wyner in [32]. 
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designed for BSC(Z)*p) when the noise is close to Ber(L'>i=p). 
Hence the decoder will succeed with high probability. 

VII. Complexity Versus Gap 

We have seen that polar codes under SC encoding achieve 
the rate-distortion bound when the blocklength N tends to 
infinity. It is also well-known that the encoding as well as 
decoding complexity grows like Q{N \og{N)). How does the 
complexity grow as a function of the gap to the rate-distortion 
bound? This is a much more subtle question. 

To see what is involved in being able to answer this 
question, consider the Bhattacharyya constants Z*^*' defined 
in ©. Let Z(*) denote a re-ordering of these values in an 
increasing order, i.e., Z^*) < Z(^+i\ i = 0, . . . , iV - 2. Define 



Fig. 6. The helper transmits quantized version of Y' . The decoder uses the 
information from the helper to decode Y reliably. 

Let the rates used by the encoder and the helper be R and R' 
respectively. Wyner [32] showed that the required rates R, R' 
must satisfy 

R>h2{D*p), R'>l-h2{D), 

for some D e [0, 1/2]. 

Theorem 13 (Optimality for the One Helper Problem): 
Let Y be a BSS and Y' be a Bernoulli random variable 
correlated to F as F' = F Z, where Z ~ Ber(p). Fix 
the design distortion D, Q < D < \. For any rate pair the form 
R > h2{D *p),R' > 1 - h2{D) and any < < i, 
there exist sequences of polar codes of length with rates 
Rn < R and R'j^ < R' so that under syndrome computation 
at the encoder, SC encoding using randomized rounding at 
the helper and SC decoding at the decoder, they achieve the 
block error probabiUty satisfying 

P| < 0(2-(^')). 

The encoding as well as decoding complexity of these codes 
is e(A^log(iV)). 

For this problem, we require a good channel code at the 
encoder and a good source code at the helper. We will explain 
the code construction here. The rest of the proof is similar to 
the previous setups. 

Encoding: The helper quantizes the vector Y' to X' with 
a design distortion D. This compression can be achieved with 
rates arbitrarily close to 1 — h2{D). 

The encoder designs a code for the BSC(-D * p). Let F 
denote the frozen set. The encoder computes the syndrome 
Up = {YH~^)p and transmits it to the decoder The rate 
involved in such an operation is R = |F|/A^. Since the fraction 
|F|/Af can be made arbitrarily close to h2{D *p), the rate R 
will approach h2{D *p). 

Decoding: The decoder first reconstructs the vector X'. 
The remaining task is to decode the codeword Y from the 
observation X'. As shown in the Wyner-Ziv setting, the 
quantization noise Y S)X' is very "close" to Ber(£)*p). Note 
that the decoder knows the syndrome Up = {YH~^)p, where 
the frozen set F is designed for the BSC(D *p). Therefore, 
the task of the decoder is to recover the codeword of a code 



(0 



E \/2(l-^«)- 

j=N-i 



For the binary erasure channel there is a simple recursion 
to compute the {Z^*)} as shown in [15]. For general channels 
the computation of these constants is more involved but the 
basic principle is the same. 

For the channel coding problem we then get an upper bound 
on the block error probability as a function the rate R of 



(P^,i?) = (m«,-^). 

On the other hand, for the source coding problem, we get an 
upper bound on the distortion as a function of the rate of 
the form 



N ' 



N' 



Now, if we knew the distribution of Z'^*'s it would allow us to 
determine the rate-distortion performance achievable for this 
coding scheme for any given length. The complexity per bit 
is always 0(logA^). 

Unfortunately, the computation of the quantities mjy and 
A/]^^ is likely to be a challenging problem. Therefore, we ask 
a simpler question that we can answer with the estimates we 
currently have about the {Z'*'}. 

Let R = R{D) + 5, where (5 > 0. How does the complexity 
per bit scale with respect to the gap between the actual 
(expected) distortion Dm and the design distortion Dl Let us 
answer this question for the various low-complexity schemes 
that have been proposed to date. 

Trellis Codes: In [5] it was shown that, using trellis codes 
and Viterbi decoding, the average distortion scales like D + 
0(2-^^(^)), where E{R) > for (5 > and K is the 
constraint length. The complexity of the decoding algorithm 
is <d{2^ N). Therefore, the complexity per bit in terms of the 
gap IS given by 0(2('°s?)). 

Low Density Codes: In [37] it was shown that under 
optimum encoding the gap is 0{^/K2~^^), for some A > 0, 
where K is the average degree of the parity check node. 
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Fig. 7. The test channel for the binary erasure source. 

Assuming that using BID we can achieve this distortion, the 
complexity is given by 8(2^^ A^). Therefore, the complexity 
per bit in terms of the gap is given by 0(2^'°^ s-*). 

Polar Codes: For polar codes, the complexity is 
e(A^logiV) and the gap is 0(2-(^'')) for any 13 < \. 
Therefore, the complexity per bit in terms of the gap is 
0(-^ loglog i). This is considerably lower than for the two 
previous schemes. 

VIII. Discussion and Future Work 

We have considered the lossy source coding problem for the 
BSS and the Hamming distortion. The reconstruction alphabet 
in this case is also binary and the test channel 'W is a BSC. 

Consider the slightly more general scenario of a g-ary 
source with a binary reconstruction alphabet. Assume further 
that the test channel, call it W, is such that the marginal in- 
duced by the source distribution on the reconstruction alphabet 
is uniform. 

Example 14 (Binary Erasure Source): Let the source al- 
phabet be {0,1,*}. Let S denote the source variable with 
distribution 

Pr(5' = 1) = Pr(5 = 0) = p/2, Vy{S = *) = 1 - p. 

Let the distortion function be 

d(0,*) = d(l,*) = 0, d(0,l) = l. (18) 

For a design distortion D, the test channel W : {0, 1} 
{0,1,*} is shown in Figure |7] Note that the distribution 
induced on the input of the channel is uniform. 

For this setup one can obtain results mirroring Theorem [T] 
More precisely, one can show that the optimum rate-distortion 
tradeoff can again be achieved by polar codes together with SC 
encoding and randomized-rounding. The proof is analogous to 
the proof of Theorem [T] The only change in the proof consists 
of replacing the BSC(Z3) with the appropriate test channel 
W . This is the source coding equivalent of Arikan's channel 
coding result [15], where it was shown that polar codes achieve 
the symmetric mutual information I{W) for any B-DMC. 

A further important generalization is the compression of 
non-symmetric sources. Let us explain the involved issues by 
means of the channel coding problem. Consider an asymmetric 
B-DMC, e.g., the Z-channel. Due to the asymmetry, the 
capacity-achieving input distribution is in general not the uni- 
form one. To be concrete, assume that it is (p(0) = ^^p{l) = 
|). This causes problems for any scheme which employs 



linear codes, since linear codes induce uniform marginals. 
To get around this problem, "augment" the channel to a q- 
ary input channel by duplicating some of the inputs. For our 
running example. Figure [8] shows the ternary channel which 
results when duplicating the input "1." Note that the capacity- 




Fig. 8. The Z-channel and its corresponding augmented channel with ternary 
input alphabet. 

achieving input distribution for this ternary-input channel is 
the uniform one. Assume that we can construct a ternary 
polar code which achieves the symmetric mutual information 
of this new channel. (For binary-input channels it was shown 
by Ankan [15] that one can achieve the symmetric mutual 
information and there is good reason to believe that an equiv- 
alent result holds for q-ary input channels.) Then this gives 
rise to a capacity-achieving coding scheme for the original 
binary Z-channel by mapping the ternary set {0, 1, 2} into the 
binary set {0, 1} in the following way; {1, 2} i-^ 1 and i-^ 0. 

More generally, by augmenting the input alphabet and 
constructing a code for the extended alphabet, we can achieve 
rates arbitrarily close to the capacity of a g-ary DMC, assum- 
ing only that we know how to achieve the symmetric mutual 
information. 

A similar remark applies to the setting of source coding. 
By extending the reconstruction alphabet if necessary and by 
using only test channels that induce a uniform distribution 
on this extended alphabet one can achieve a rate-distortion 
performance arbitrarily close to the Shannon bound, assuming 
only that for the uniform case we can get arbitrarily close. 

The previous discussion shows that perhaps the most impor- 
tant generalization is the construction of polar codes for both 
source and channel coding for the setting of g-ary alphabets. 

In Section [VT] we have considered some scenarios beyond 
basic source coding. E.g., we considered binary versions of the 
Wyner-Ziv problem as well as the Gelfand-Pinsker problem. 
This list is by no means exhaustive. 

One possible further generalization is to have source codes 
with a faster convergence speed. In [38] it was shown that, 
by considering larger matrices (instead of G2), it is possible 
to obtain better exponents for the block error probability of 
the channel coding problem. Such a generalization for source 
coding would result in better exponents in the convergence of 
the average distortion to the design distortion. 
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Appendix 

The proof of (HI and (|5]l is based on the following approach. 
For any channel W : X ^ y the channels W^^^ : X ^ 
3^ X 3^ X ?7q~^ are defined as follows. Let denote the 
channel law 

(yo, Vi I uo) = W^(2/o I uo ® | ui), 

Ml 

and let M^I^l denote the channel law 

W^'^kyc'/iiUo Iwi) = \^{yQ\uQ®ux)W{yx\ux). 

Define a random variable Wn through a tree process 
{W„;n > 0} with 



where {i?„;n > 1} is a sequence of i.i.d. random variables 
defined on a probability space /i), and where i3„ is a 

symmetric Bernoulli random variable. Defining JFq = {0, 17} 
and Tn = . . . , i3„) for n > 1, we augment the above 

process by the process {Zn;n > 0} := {Z{Wn);n > 0}. 
The relevance of this process is that Wn e {M^^'^l-Io ^ and 
moreover the symmetric distribution of the random variables 
Bi implies 



Pr(Z„e (a,6)) = 



\{ie {0,...,2"-l} : Z(') e {a,b)}\ 



on 



(19) 



In [15] it was shown that 

lim Pr(Z„ < 2-5"/-*) = I{W). 

n — >oo 

which implies In [16] the polynomial decay (in terms of 
N ~ 2") was improved to exponential decay as stated below. 

Theorem 15 (Rate of Zn Approaching [16]): Given a B- 
DMC W, and any /3 < i, 

lim Pr(Z„ < 2-2"^) = I{W). 

n — >oo 

Of course, this implies (|5]l. For lossy source compression, the 
important quantity is the rate at which the random variable Z„ 
approaches 1 (as compared to 0). Let us now show the result 
mirroring Theorem [15] for this case, using similar techniques 
as in [16]. 

Theorem 16 (Rate of Zn Approaching \): Given a B- 
DMC W, and any /3 < i, 

lim Pr(Z„ > 1 - 2^2"'*) = 1 - 1{W). 

n — »oo 

Proof: Using Lemma [171 the random variable Zn+i can 
be bounded as, 

Zn+i > V^Z?, - Zi w.p. i 

9 1 

Zn+1 = Zi W.p. -. 

Then, with probability i, Zl;^j^^ > 1 — (1 — This implies 

that 1 - < (1 - Z"^)^. Similarly, with probability \, 

1 - Zl^i = ^-Zt< 2(1 - Zl). 



Let Xn denote Xn = 1 — Z^^. Then {X„ : n > 0} satisfies 

9 1 
Xn+l < Xi W.p. -, 



Xn+1 < 2Xn W.p. 



1 



By adapting the proof of [16], we can show that for any j3 < ^, 

lim Pr(X„ < 2-2"'') ^ 1 _ i(w). 

n — >oc 

Using the relation Xn = 1 — Z^^ > 1 — Z„ , we get 
lim Pr(l - Zn< 2-^"") = 1 - I{W). 

■ 

Lemma 17 (Lower Bound on Z): Let Wi and W2 be two 
B-DMCs and let Xi and X2 be their inputs with a uniform 
prior. Let Yi S and Y2 S 3^2 denote the outputs. Let W 
denote the channel between X = Xi (B X2 and the output 
{Yi,Y2), i.e., 

W^(yi, 2/2 I = ^ ^ 1^1(2/1 I X © u)W2[y2 I w). 

Then 



z{w) > ^/ziWi)^ + z{W2Y - z{w{fz{yv2Y. 

Proof Let Z = Z{W) and Z, = Z{W.,). Z can be 
expanded as follows. 



Z=Y. ^/W{yuy2\Q)W{yi.V2\l) 

= \Y. \wi{yi\m2{v2\mi{yi\m2{y2\i) 

VI, y2 

+ Wiiyi I 0)W2iy2 1 0)Wi{yi \ 1)^^2(2/2 | 0) 
+ Wiiyi I l)W2{y2 1 l)Wi{yi \ 0)^^2(2/2 1 1) 

+ 1^1(2/1 1 l)t^2(y2 1 l)W^i(2/i I l)W^2(2/2 I 0) 



'M^i(2/i I 0) , Wi{y, 1 1) , W2{y2 \ 0) ^ ^^2(2/2 1 1) 



yM^i(2/i|l) Wiiyi\0) M^2(2/2|l) M^2(2/2|0) 
where Pi{yi) denotes 



P.{y. 



^W,{y,\0)WMl) 

z. 



Note that Pi is a probabiUty distribution over Let denote 
the expectation with respect to Pi and let 



My) ^ 



iW,{y\Q) , W,{y\l) 



. W,{y\l) 
Then Z can be expressed as 



W,{y\Q) 



Z = ^IEi,2 



^{A,{Y,)f + {A2{Y2)f-A 



The arithmetic-mean geometric-mean inequality implies that 
A^{y) > 2. Therefore, for any yi e y^, A ,{y,)'^ - 4 > 
0. Note that the function f{x) = \/x^ + a is convex for 
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a > 0. Applying Jensen's inequality first with respect to the 
expectation Ei and then with respect to E2, we get 

Z>^¥.2 ^(Ei [AiiYi)]f + iA2iY2)f - 4 
> ^\/(Ei [A,iY,)]f + (E2 [^2(^2)])' - 4. 



Zi ■ 



The claim follows by substituting Ei[^i(yi)] 
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